Multitrait genome-wide analyses identify new susceptibility loci and candidate drugs to primary sclerosing cholangitis

Primary sclerosing cholangitis (PSC) is a rare autoimmune bile duct disease that is strongly associated with immune-mediated disorders. In this study, we implemented multitrait joint analyses to genome-wide association summary statistics of PSC and numerous clinical and epidemiological traits to estimate the genetic contribution of each trait and genetic correlations between traits and to identify new lead PSC risk-associated loci. We identified seven new loci that have not been previously reported and one new independent lead variant in the previously reported locus. Functional annotation and fine-mapping nominated several potential susceptibility genes such as MANBA and IRF5. Network-based in silico drug efficacy screening provided candidate agents for further study of pharmacological effect in PSC.


GWAS summary statistics for PSC and other polygenic traits used in the present study.
We downloaded publicly available GWAS summary statistics from existing data resources, the NHGRI-EBI GWAS Catalog 1 , the MR IEU OpenGWAS data 2,3 , and the international PSC Study Group 4 . PSC summary statistics can be downloaded from the international PSC Study Group (IPSCSG; https://www.ipscsg.org/published-studies/) and the NHGRI-EBI GWAS Catalog (http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST004001-GCST005000/GCST004030). The details, including data code, category of trait, sample size, number of SNPs, reference with PMID, and download link, are shown in Supplementary Data 1.

Estimates of genetic heritability and pairwise genetic correlation using LDSR analysis.
We implemented LD score regression (LDSR; ldsc v1.0.1, https://github.com/bulik/ldsc) 5,6 to estimate genome-wide SNParray heritability (h2) representing the proportion of phenotypic variance explained by all common SNPs and to examine the shared genetic contribution (rg) of PSC against numerous polygenic traits using publicly available GWAS summary statistics and linkage disequilibrium (LD) information with European samples from 1000 Genome Project as a reference for the pattern of genome-wide LD. LDSR is a method regressing χ 2 statistics from summary statistics on LD scores, estimating genetic correlation without bias due to population stratification or cryptic relatedness 5,6 . By regressing SNP-level associations for two traits, (i.e., the product of Z scores, ZPSC× Ztrait1) and weighting each SNP by its LD score, which is an estimate of the total amount of genetic variations tagged by each variant, we can estimate the magnitude and direction of the shared genomic architecture between PSC and a tested trait. We first implemented the command option of LD Score with "munge_sumstats.py" to generate the ".sumstats" format from the summary statistics after selecting approximately 1.22M HapMap3 SNPs with minor allele frequency (MAF) > 0.01 and exclusion of multi-allelic SNPs and the major histocompatibility complex (MHC) region (Chr6:25Mb-34Mb) as recommended. The major histocompatibility complex (MHC) region was excluded from summary statistics because of the complex and extended LD pattern and genetic architecture of the MHC region 7-10 . We then applied "ldsc.py --rg PSC.sumstats.gz, trait1.sumstats.gz --ref-ld-chr eur_w_ld_chr/ --w-ld-chr eur_w_ld_chr/ --out PSC_trait1".

Joint association analysis of multi-traits on PSC.
We carried out MTAG 11 (mtag v1.0.8; https://github.com/JonJala/mtag) to discover new independent PSC risk-associated loci against polygenic traits that demonstrated strong genetic correlation with PSC through LDSR analysis. MTAG 11 estimates the effect size per SNP for each trait by incorporating information included in other correlated traits by utilizing bivariate LD score regression to gauge sample overlaps between summary statistics from multiple GWAS as input in MTAG (since MTAG does not require summary statistics from independent samples 11 among the multiple GWAS). It then carries out a combined LD score regression and meta-analysis approach from the genetically related traits. MTAG aligned all alleles in all summary statistics analyzed, and SNPs not present in any summary statistics were removed further. The output from MTAG provides re-estimated effect size and P-value (P) per each SNP for each trait.

Summary statistics imputation.
Since MTAG utilizes a common set of SNPs that overlap among all tested traits, MTAG implementation with high-powered GWAS is recommended 11

Colocalization between GWAS and Expression quantitative trait locus signals from the Genotype-Tissue Expression
Based on the prediction that risk variants may exert their effects via various tissues, we surveyed all 49 tissue types available in Genotype-Tissue Expression (GTEx) 14 v8. The sample size of each tissue is described in Supplementary Data 3. All variant-gene cis-expression quantitative trait locus (eQTL) associations tested in each tissue within ±100kb windows of the lead variant presented in MTAG_PSC were extracted. Colocalization of the MTAG_PSC and eQTL signals was calculated using the LD-independent colocalization approach (coloc v5.1.0; https://cran.r-project.org/web/packages/coloc/) 15 .
Although the PSC GWAS 4 implemented population stratification analysis and focused on European-descent individuals, we applied the LD-independent approach to avoid spurious colocalizations due to the violation of common LD assumption in the GWAS summary-level data. We considered the colocalizations when coloc suggested the plausible posterior probability that both PSC and a tissue from GTEx v8 are associated and share a single function variant (PP4 > 0.80).

Gene-based functional enrichment analysis using STRING database and DAVID Bioinformatics Resources
We carried out gene-based enrichment analysis of protein-protein interaction (PPI) networks amongst 406 potential candidate genes from position mapping, eQTL mapping, and chromatin interaction mapping as well as newly 19 MTAGidentified and previously reported PSC risk-associated genes using STRING PPI networks (STRING v11.5; https://stringdb.org/cgi/input?sessionId=bmwWOuutn8ZR). We selected the setting options with the max number of interactions = 20 and the highest confidence score of 0.9 to survey the functional enrichment of numerous pathways by genes. We also utilized the Database for Annotation, Visualization, and Integrated Discovery (DAVID v6.8; https://david.ncifcrf.gov/) Bioinformatics Resources to survey the enrichment of various functional annotations (count > 2, P-value < 0.05).

Network-based proximity between drugs and disease-identified proteins for drug repurposing
To estimate a drug-disease proximity measures, distance (d) and the corresponding relative proximity (z), we implemented network-based proximity analysis for drug repurposing developed by Guney et al. 16 . The method provides a relative measure that quantifies the network-based proximity (or closeness) between drugs and disease proteins encoded by genes associated with disease 12,16 . We obtained the drug and drug target data from the DrugBank database 17